feat: add StringUtils, DateTimeUtils, PathUtil, OptionsUtils, RapidJsonUtil, and Jsonizable utilities#23
Conversation
…onUtil, and Jsonizable utilities
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for adding the utility modules. I found a blocking issue in PathUtil: NormalizePath currently turns valid URI paths without an authority, such as file:///tmp or hdfs:///warehouse, into file:/tmp / hdfs:/warehouse. In URI syntax the triple slash is significant here because it means an empty authority plus an absolute path; collapsing it changes the URI shape and can break filesystem/catalog path handling. Please preserve the scheme:///path form when the input has an explicit empty authority, and add tests for file:///.../hdfs:///... normalization.
leaves12138
left a comment
There was a problem hiding this comment.
Correction for my previous NormalizePath comment: I rechecked Apache Paimon Java Path.toString(). For new Path("file:///tmp") / new Path("hdfs:///warehouse"), the parsed authority is null and toString() emits file:/tmp / hdfs:/warehouse, not the triple-slash form. Since this C++ implementation returns Path::ToString() and matches that Java behavior, the NormalizePath URI-shape issue I raised should be ignored. No change is needed for that part.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-reviewed PR #23 after correcting my previous NormalizePath concern. The path normalization behavior is compatible with Java Paimon and does not need to change. I found two other issues that should be fixed before merging: date/timestamp parsing currently accepts impossible dates by letting libc normalize them, and JSON deserialization of narrow integer types can silently truncate out-of-range values.
| if (ss.fail()) { | ||
| return Status::Invalid(fmt::format("failed to convert string '{}' to date", str)); | ||
| } | ||
| std::time_t time = timegm(&timeinfo); |
There was a problem hiding this comment.
This accepts invalid calendar dates because std::get_time only fills tm, and timegm/mktime normalizes out-of-range fields instead of rejecting them. For example, StringToDate("2019-02-29") will become 2019-03-01, and StringToTimestampMillis("2023-02-30 00:00:00") will become 2023-03-02 00:00:00. Java Paimon's DateTimeUtils.parseDate explicitly validates month/day ranges and rejects invalid dates, so this can persist or compare wrong metadata values. Please validate the parsed year/month/day before converting, or compare the normalized tm back to the parsed fields, and add tests for non-leap Feb 29 / Feb 30.
leaves12138
left a comment
There was a problem hiding this comment.
One more inline issue for the JSON integer deserialization path.
| if (!value.IsInt()) { | ||
| throw std::invalid_argument("value must be int"); | ||
| } | ||
| return static_cast<int8_t>(value.GetInt()); |
There was a problem hiding this comment.
The small-integer JSON readers only check RapidJSON's broad IsInt() / IsUint() predicates, then cast to int8_t / uint8_t / int16_t / uint16_t. This silently wraps out-of-range JSON values, e.g. deserializing 500 as int8_t yields a truncated value instead of failing. Since StringUtils::StringToValue already rejects the same overflow cases, JSON deserialization should be consistent: check value.GetInt() / GetUint() against std::numeric_limits<T>::min()/max() before casting, and add overflow tests for the narrow integer types.
Purpose
No Linked issue.
Introduce string/JSON/path utility modules to the apache/paimon-cpp codebase. These modules provide common string manipulation, date-time parsing, path operations, configuration option parsing, and JSON serialization helpers.
Introduce DateTimeUtils for timestamp parsing and formatting (date_time_utils.h)
Introduce TimezoneGuard test utility for timezone-dependent tests (timezone_guard.h)
Introduce StringUtils for string splitting, trimming, type conversion, and formatting (string_utils.h, string_utils.cpp)
Introduce RapidJsonUtil for JSON document read/write helpers (rapidjson_util.h)
Introduce PathUtil for file path manipulation and generation (path_util.h, path_util.cpp)
Introduce OptionsUtils for parsing configuration key-value options (options_utils.h)
Introduce Jsonizable CRTP base for JSON serialization/deserialization (jsonizable.h)
Tests
date_time_utils_test.cpp — Timestamp parsing, formatting, timezone handling
string_utils_test.cpp — String split, trim, conversion, and format operations
rapidjson_util_test.cpp — JSON document creation, read, and write
path_util_test.cpp — Path join, normalize, and UUID-based temp path generation
options_utils_test.cpp — Configuration option key-value parsing
jsonizable_test.cpp — JSON serialization and deserialization round-trip
API and Format
No API or format changes.
Documentation
No documentation changes
Generative AI tooling